Project 5 - Feature Selection, Model Selection and Tuning

Title: Credit Card Users Churn Prediction

Author: Pankaj Singh


Domain

Market analytics/Customer prediction

Project Context

The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.

Data Description

The data contains characteristics of the clients.

Customer Information

Project Deliverables

Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas

You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards

You need to identify the best possible model that will give the required performance

Learning Objective

Import Libraries

Load data \& get a first glimpse

CLIENTNUM does not seem to provide any useful information , so we will drop the CLIENTNUM column

Observations:

Exploratory Data Analysis

Observations from EDA of numerical data

Observations on EDA of categorical variables

Outlier identification and treatment

Firstly we will identify which columns have significant number of outliers based on the %age data out of boxplot whisker range

Missing value treatment

We have identified following missing values in the dataset

  1. Marital_Status has 749 missing values
  2. Education_Level has 1519 missing values
  3. Income_categoery has 1112 values as 'abc' and hence require replacement

Splitting dataset, oversampled and undersampled training sets

Here we will split our dataset into train test and validation sets

Model Buliding

Model evaluation criterion

Model can make wrong predictions as:

  1. A customer who drops the credit card but the algorithm says they won't, then the bank loses an existing customer and hence business.

  2. A customer who doesn't cancel the credit card but the algorithm says they will, then the bank loses some resources in offering those customers to stay and hence looses some resources cost.

Which case is more important?

How to reduce this loss i.e need to reduce False Negatives?

First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.

Here we will build basic logistic regression model with standard scaler

Here we will build a basic decision tree model with standard scaler

Here we will build a basic Random forest classifier model with standard scaler

Here we will build a basic Bagging Classifier model with standard scaler

Here we will build a basic Gradient Boosting classifier with standard scaler

Here we will build a basic Adaboost classifier with standard scaler

Here we will build a basic XGBoost classifier with standard scaler

Here we will build basic logistic regression model with standard scaler trained on oversampled data

Here we will build a basic decision tree model with standard scaler trained on oversampled data

Here we will build a basic Random forest classifier model with standard scaler trained on oversampled data

Here we will build a basic Bagging Classifier model with standard scaler trained on oversampled data

Here we will build a basic Gradient Boosting classifier with standard scaler trained on oversampled data

Here we will build a basic Adaboost classifier with standard scaler trained on oversampled data

Here we will build a basic XGBoost classifier with standard scaler trained on oversampled data

Here we will build basic logistic regression model with standard scaler trained on undersampled data

Here we will build a basic decision tree model with standard scaler trained on undersampled data

Here we will build a basic Random forest classifier model with standard scaler trained on undersampled data

Here we will build a basic Bagging Classifier model with standard scaler trained on undersampled data

Here we will build a basic Gradient Boosting classifier with standard scaler trained on undersampled data

Here we will build a basic Adaboost classifier with standard scaler trained on undersampled data

Here we will build a basic XGBoost classifier with standard scaler trained on undersampled data

Best parameters are {'xgbclassifiersubsample': 0.9, 'xgbclassifierscale_pos_weight': 10, 'xgbclassifierreg_lambda': 5, 'xgbclassifiern_estimators': 50, 'xgbclassifiermax_depth': 1, 'xgbclassifierlearning_rate': 0.01, 'xgbclassifier__gamma': 1} with CV score=1.0: CPU times: user 3.27 s, sys: 267 ms, total: 3.53 s Wall time: 17min 13s

Based on our analysis and details of Gradient Boost Classifier, the most important factors which can predict attrition of an existing customers are,

  1. Total_trans_ct
  2. Total_trans_amt
  3. Total_revolving_bal
  4. Total_amt_chng_q4_q1
  5. Total_ct_chng_q4_q1

Based on SHAP values of the variables that we have found to be important,

  1. Total_trans_ct: higher total transaction count is a strong indicator of customer going to continue with the bank. It doesn't really matter how high or low the values of the feature is, as long as customer is using the credit card for transaction, they are going to stick with their credit cards.
  2. Total_trans_amt: Smaller values of transaction amount is an indicator of customer dropping their credit cards
  3. Total_revolving_bal: A smaller revolving balance indicates customer is planning to continue with the credit cards and higher value indicates otherwise
  4. Total_amt_chng_q4_q1: In general higher the change in amount from q4 to q1 indicates attrition.
  5. Total_ct_chng_q4_q1: In general higher the change in count from q4 to q1 indicates attrition.

Apart from the above feature few otheer features to pay attention towards are,

  1. Contacts_count_12_mon: Higher the number of contacts in the last 12 month is a strong sign of attrition
  2. Total_months_inactive: higher the number of inactive months is also a sign that customer might drop their cards
  3. Dependent_count: Although not very strong, but higher dependent count can indicate that customer might leave their cards
  4. Marital_status_married: Again, not a strong correlation but usally married customers stick with the credit card.

Creating a Pipeline

Summary and Business reccomendations